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SUMMARY 

The latest trends in high-performance computing systems show an increasing demand on the use of 
a large scale multicore systems in a efficient way, so that high compute-intensive applications can be 
executed reasonably well. However, the exploitation of the degree of parallelism available at each multicore 
component can be limited by the poor utilization of the memory hierarchy available. Actually, the multicore 
architecture introduces some distinct features that are already observed in shared memory and distributed 
environments. One example is that subsets of cores can share different subsets of memory. In order to 
achieve high performance it is imperative that a careful allocation scheme of an application is carried out 
on the available cores, based on a scheduling model that considers the main performance bottlenecks, as 
for example, memory contention. In this paper, the Multicore Cluster Model (MCM) is proposed, which 
captures the most relevant performance characteristics in multicores systems such as the influence of 
memory hierarchy and contention. Better performance was achieved when a load balance strategy for a 
Branch-and-Bound application applied to the Partitioning Sets Problem is based on MCM, showing its 
efficiency and applicability to modem systems. Copyright © 0000 John Wiley & Sons, Ltd. 
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1. INTRODUCTION 



Multicore architectures have become dominant today due to the considerable enhancement on 
computing systems performance. Multicores can be found in a variety of domains. Currently, 
high performance platforms like clusters are composed of multicore nodes or multicore clusters 
connected by network channels. These modern platforms suggest a hierarchical memory: cores that 
belong to the same processor can share caches, cores belonging to different processors share main 
memory (like RAM or DRAM) and cores that belong to different nodes do not share any memory 
resource ||T1|2]. 

Parallel applications could benefit from such memory hierarchy to improve performance. The use 
of cache as shared memory can reduce the communication time between the tasks of an application, 
and, therefore, tasks that communicate more frequently should be placed in cores that share cache, 
avoiding communications in main memory or message passing over the network ||2]|3]|4]. However, 
depending on the amount of memory required for communicating and computing tasks, allocating 
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tasks in many cores that are sharing the cache may exceed its capacity, making necessary too many 
accesses to main memory. These accesses can cause a bottleneck in the channels and worsen the 
application performance ||Tl|4]|5]|6l. 

Using the environment characteristics in order to improve application performance is not new. For 
doing so, it is necessary to define models that represent the most relevant features of the environment 
where the application will run. Nonetheless, this is not an easy task and scheduling algorithm or load 
balance strategies should be based on such a model and providem better application's runtime. 

This paper proposes the Multicore Cluster Model (MCM), which was based on an extensive set of 
experiments of a synthetic application that identifies the potential bottlenecks promoted by sharing 
memory resources and their impact when executing computation and communication tasks. The 
model considers three levels of communication: i) the communication made through shared memory 
by intra-chip cache, ii) through inter-chip shared memory and iii) communication between cluster 
nodes via messages. Scheduling and load balance strategies should be adjusted considering the 
architecture model and the characteristics of the application, so that it takes the maximum advantage 
of the execution environment. A long these lines, a load balance strategy for a class of branch-and- 
bound application based on MCM is also proposed. 

In order to evaluate and validate our proposals, a parallel branch-and-bound algorithm applied 
to the set partitioning problem (PBBspp) was developed based on a load balance mechanism 
also introduced here. The experiments confirm that the model represents relevant features of the 
architecture which affect the application performance. The results showed that when memory access 
bottlenecks are avoided, the execution time of PBBspp can be improved by up to 70%. 

Summarizing, the main contributions of this work are the following: 

1 . A new model that considers not only the relevant architectural characteristics of processing 
and communication via different levels of memory and network in a multicore cluster, but 
also how those characteristics are impacted by the amount of memory required by the 
application tasks. Thereby, the impact that the quantity of memory required by processing 
and communicating tasks on the execution and communication costs where measured and 
modeled. The objective here is to provide a model that includes into the typical processing and 
communication costs, the one associated with contention in the different levels of memory. 

2. Based on the model, a novel load balance strategy is proposed in which the memory 
hierarchy is accounted when communication is held and the quantity of data allocated to each 
task is evaluated so that the work load is balanced, avoiding therefore memory contention 
bottlenecks. 

3. Finally, a real application based on the branch-and-bound algorithm was used to validate 
the proposed work. In the related literature, there is a large number of papers about 
parallel branch-and-bound, but, to the best of our knowledge, few of them were designed 
to take advantage of a computing system with both shared and distributed memory. The 
implementation of the parallel branch-and-bound used here was based the proposed load 
balance strategy. 

The remaining of this paper is organized as follows. Section 2 presents the related literature about 
high performance architecture models. A set of tests used to identify the relevant characteristics of 
multicore clusters and a new load balance mechanism based on the obtained results are introduced in 
Section 3. Section 4 presents the use of the proposed load balance strategy in a parallel branch-and- 
bound to solve the Set Partitioning Problem. Experimental results and analysis, aiming to evaluate 
the efficiency of the resulting application, are shown in Section 5. Section 6 concludes the paper. 

2. HIGH PERFORMANCE PLATFORMS MODELS 

Due to the variety of parallel and distributed architecture, it is difficult to define a precise and yet 
general model of parallel computation. On the attempt to identify the actual trend, this section 
outlines models of parallel computation with the aim to identify the relevant characteristics that 
must be considered when executing parallel applications. 



It is already well stablished the distinction between distributed memory, where each processor has 
its own local memory, and shared memory, where all processors have access to a common memory. 
For many years, high performance computing was developed based on distributed systems mainly 
due to their potential to solve much larger problems and their scalability. However, at the same time, 
in order to improve the performance of processors even further, architectural designers put together 
more and more processor cores on the same chip, promoting the multicore advent. In this case, good 
performance relies on the software ability to exploit the shared memory hierarchy. For doing so, it 
is important to define a computation model that incorporates the parameters of parallel architectures 
that are essential to characterize the parallel systems. 

2.1. Model for shared memory architecture 

The Parallel Random Access Machine (PRAM) model 121 consists of a number of processors, each 
of which computes one instruction in one time unit, on different data, synchronously, and then 
communicates via shared memory, also within one step |l8l. The great acceptance of the PRAM 
model by the theoretical community has been due to its simplicity and universality and a large 
number of parallel algorithms based in it have been designed. While the PRAM model is an idealistic 
one, unfortunately it is not a realistic. Nevertheless, much research effort has been expended on 
the attempt to incorporating critical parameters of parallel systems, mainly the ones related to 
communication overhead ll9l [T0l[TT1[T2l[T3]IT4l . 

In early 90's, due to the continuous technological advances on memory bandwidth and latency, 
the use of shared memory was a reality. Since the program designer wish to take full advantage of 
the memory system, it is necessary to consider the time to access not only the local main memory but 
also the other several levels of memory. Aggarwal et al in |[T5l proposed the Hierarchical Memory 
Model (HMM) designed to capture the effect of memory hierarchy. HMM considers a random 
access memory machine where access to memory location x requires \logx\ time instead of the 
typical constant access. An extension of HMM, the HMBT, was proposed in 1161 in which a block of 
consecutive locations can be copied in constant time after the initial latency access is paid. However, 
both models do not consider parallel machines. Thus, ifTTl introduced extensions of the HMBT to 
model memory systems in which data transfers between memory levels may proceed concurrently. 

Already in ifTSl . the Parallel Memory Hierarchy (PMH) models a computer as a tree of memory 
modules with processors in the leaves. The main characteristic is the representation of the transfer 
cost of a block of data between the tree nodes. In |[T9l . the Uniform Memory Hierarchy (UMH) is 
proposed, the cost of data movement between different levels of the memory hierarchy. Although 
the works above mentioned are two decades old, it is interesting to note the evidence of current 
architectures characteristics such as multicore clusters, especially the relative impact of the memory 
hierarchy in the performance of applications. These set of works however, lack mainly on modeling 
both distributed and shared memories. 

Gibbons et al in IfTTI introduced the Queuing Shared Memory (QSM) model, which accounts 
for limited communication bandwidth while still providing a simple shared-memory abstraction. 
The QSM model consists of processors with individual private memory as well as a global shared 
memory. However this model ignores the memory hierarchy in a processor. 

2.2. Model for distributed memory architecture 

With the objective of designing a scalable system, distributed memory networks have become 
the main stream for the specification of an efficient solution for very large dimension problems. 
However, the performance of these proposed solutions can be affected by the limitation on 
bandwidth and latency on communications. Many researchers have evaluated the behavior of 
distributed memory architectures, with the aim of designing a general purpose parallel model. The 
Distributed Memory Model consists of a set of processors (with local memory) connected by links 
under some topology, and communication is carried out trough message passing. 

In attempting to address the issues related to the communication cost in distributed memory 
systems, a couple of models merit discussion: the delay model, in which the delay on the 
communication between any two processors, no matters their distance in the network ll20l is 



captured. This model has been widely used to represent distributed memory systems, incorporating 
issues like the heterogeneity of processors ET\ . 

The absence of a standard model of parallel computation influenced many researchers to work 
on the attempt to establish a bridge between parallel applications and parallel machines. Valiant 
II22II defined the Bulk-Synchronous Parallel (BSP) model, which represents a set of processing 
elements, their speed, the time between two synchronization events, which characterizes a superstep. 
It is during each superstep that computation of tasks and message delivery between processors are 
supposed to be carried out. In a continuous search for more accurate models and with the advent of 
computer clusters, studies led to the specification of HBSP |[23l to model the heterogeneity of the 
processors, concerning their speeds and capacities. 

Due to the emergence of network of workstations as high performance environment, the LogP 
model f24\ was proposed to be a computational model in which global characteristics of parallel 
architectures are represented, such as number of processing elements, latency on the transmissions, 
gap between subsequent messages and overhead on the sending and receiving of messages. The 
key issues stated in the model were related to communication and non-synchronous computations. 
Following this work, other extended LogP models were proposed, as for example, in the LogGP 
Model 1251 . the gap associated with the sending of long messages was represented more accurately, 
while in the LogGPS 12611 . the cost associated with the necessary synchronization when sending 
a long message under the MPI library is also captured. LoPC lETIl addresses contention problem 
that arises when sending messages in multiprocessors, i.e considers the sharing of global memory 
between processors. Regarding the point-to-point communication (i.e. send messages), which 
requires moving data from the source process local memory to the target process local memory, 
the models Lo.g„P and Log^P are proposed in [28|. The model includes middleware costs into the 
representation of distributed communication. 

Note that, on the comparison between the BSP and LogP models researchers have classified BSP 
as a suitable abstraction for parallel application development, while LogP offers a better resource 
management l29ll30ll . 

Following the advent of computer cluster, OTI |32] l28l captured more precisely the sending and 
receiving overheads and latency. In their work, these costs depend on the size of the transmitted 
message, such that the costs being not the same for any transmission. 

Yet, the architectural evolution has shown the benefits of a hybrid memory parallel system, where 
distributed memory computer are composed of machines with shared memory. Due to the actual 
technological advances, increasing execution performance of parallel applications on multicore 
systems become a reality. Still, further improvements are possible by properly characterizing such 
environments. 

2.3. Multicore architectures - Models for distributed and shared memory architecture 

The actual trends for a cluster of multiprocessors are the multicore machines, which are connected 
by a network of some specific topology (as in a distributed memory multicomputer) thus defining 
a hybrid memory architecture that supports a hierarchical memory system. At the first level of 
the hierarchy, fine-grained applications could be performed reasonably well, while the second 
level supports efficiently coarse-grained applications. This ideal hierarchical parallelism modeling 
may be very powerful for the exploitation of the natural parallelism found in a great variety of 
applications. 

Subsets of cores in a multicore machine may share different layers of memory levels. For example, 
usually, a small subset of cores shares L2 caches, while another subset of higher cardinality may 
share L3 caches, being the global memory shared by all the cores of the machine l33][34ll35l[36l . 
The modeling of such memory hierarchy sharing is still a challenge HI. 

Multicores cannot be treated merely as shared memory processors like conventional symmetric 
multiprocessors (SMPs), mainly due to the design of multi-level cache hierarchies, which lead to 
a reduction on the memory bottleneck. Therefore, application performance will potentially benefit 
with a proper modeling of this architecture, mainly parallel ones (either that share or exchange data 
via message passing). 



Typically, in shared memory models, the sharing happens for all processors at the main memory 
level. However, multicore processors have a varying degree of caches sharing at different levels. 
The Unifield Multicore Model (UMM) proposed in |[35l assumes that sets of cores share first-level 
caches, which in turn share second-level caches and that the cache capacity is the same for all 
caches at a given level. Also, in this work, lower bounds are derived for numerical application, but 
distributed memory is not account. 

Memory hierarchy should be captured among three levels of communication in a multi-core 
cluster: intra-processor, when communication is held between two cores on the same processor; 
inter- processor, when communication is carried out across processors but within the same machine; 
and inter-machine, between two cores on different machines. For the same message size, ll37l[38l 
captured distinct communication costs when communication is held between different levels. More 
specifically, 138)1 defines an analytical model that considers different memory levels, and specifies an 
affinity degree between threads, depending on the data amount exchanged between them. Threads 
with higher affinity should be allocated to cores that shares lower memory level (i.e. cache), in order 
to avoid higher communication costs when these threads are in distinct processors. In this case, 
recall that main memory is being shared. Nonetheless, this model does not consider memory size, 
and at the end, too many threads can be allocated to share the same cache, and as a consequence 
the amount of cache miss might be increase ll34l |39l . The importance of accurately representing 
the communication costs depending on the memory hierarchy regarding the evaluation carried 
out by 1341 on various applications, suggested that intra and inter-processor communication is 
as important as inter-machine communication, and data locality techniques that avoid memory 
contention must be designed to improve application performance. 

2.4. The application model 

The application model defines the relevant characteristics related to the application performance, 
which is usually represented by directed acyclic graphs (DAGs), denoted by G ~ {V^E^e^uj), 
where: the set of n vertices V represents tasks; E, the precedence relation among them; e{v) is 
the amount of work or computational weight associated with task v &V; and u}{u, v) is the amount 
of transmitted data or communication weight associated with the edge (u, v) e E, representing the 
amount of data units transmitted from task uto v. Also, since in the target system being considered 
in this work, memory sharing is closely related to the application performance, the amount of data 
required by task v must be depicted and is represented by fi{v). 



3. ON MODELING MULTICORE CLUSTERS 

In order to identify the influence of the relevant architectural characteristics on the application 
performance on multicore systems, a simple application, based on POl |4T1 was applied. 
This application consists of two tasks that execute two well defined phases: computation and 
communication. The computation phase corresponds to a two nested loops that scans a vector of 
integers in steps of IK bytes, so that hardware prefetching is avoided, since the step size is bigger 
than any cache line and also the cache size is a multiple of this step size HTl . The manner in which 
the vector is accessed also avoids further optimizations carried out by the compiler, as discussed in 
M- 

The communication phase consists of the sending of a message from one task to another, such that 
one task executed a sending command, while the other a receiving. The way that this communication 
is actually carried out depends on whether the communicating tasks are allocated: if they are on the 
same machine, communication is held via shared memory, where semaphores are used to prevent 
race condition. Otherwise, a message is effectively transmitted. 

All the experiments described in this section were executed in at least two machines of the 
multicore cluster RIO with Gigabit interconnection network. Each machine is a quad-core Intel 
Xeon E5410 - Harpertown, each core with a private LI cache of 64KB, and every two cores share a 
L2 12MB cache in each one of the two processors of a machine. All the four cores have a uniform 



access to a 16MB main memory module. Cent OS 5.3 is the operating system with kernel version 
2.6.18. The application is implemented with Intel MPl version 4.0.0.028 and Posix was used to 
create threads. The PAPI tool P2l was used to collect and evaluate the execution performance of 
the application. 

In order to evaluate the influence of memory sharing during the execution of the application tasks 
on the machine cores, the following allocation was set: 

i. two tasks were allocated to the same core, and consequently, accessing the same cache (SC); 

ii. two tasks allocated to different cores, but sharing the same cache (SCM); 

iii. two tasks allocated to cores that do not share the same cache, but share the main memory 
(SMM); 

iv. two tasks allocated to cores of distinct machines (DM), where the global memory of each 
machine is not shared; 

Let fi{v) be the vector size allocated by a task v during the computation phase, as described above. 
In order to enforce a given allocation of a task to a specific core, the system call set_aff inity() |[3l l43]| 
was used and also, application tasks and system processes were not executed on the same core. 

3.1. Computation Phase Tasks 

In this experiment, two independent tasks vi and V2, which do not communicate, were allocated 
under the SC, SCM and SMM allocation only. Note that in this experiment, each task only performs 
the two nested loops that scans the vector and no sending and receiving was specified. 

It was observed that, even though the amount of data of both tasks is less than the cache capacity, 
the allocation SC was the one that produced the worst execution times, as shown in Figure [T] This 
is due to the fact that, in the case of SC, both tasks were competing for the same computational 
resource. In the case that the amount of data allocated by each task is between 3MB and 6MB, the 
allocation SMM provided the best performance, since even when the whole amount of date for 
both tasks fi{vi) and ^{v2) was more than the cache capacity, the number of cache misses degraded 
the execution performance in the case of SCM. Therefore, it is better to use SMM, but on the same 
machine, since L2 cache is not shared. In the SMM allocation, the time can be reduced in 14.88%, 
when comparing with the SCM allocation (distinct cores, but same cache). As a consequence for 
nivi) > QMB both tasks need more than the cache capacity and obviously, the number of global 
memory accesses highly increases. 

It is important to note that, although the execution time for two tasks executed on the same core 
(SC) is worse than the other two allocations (SCM and SMM), the relative number of cache misses 
are smaller than those for SCM and SMM, as seen in Figure [T](a). This is in fact due to the sharing 
of computational resource rather than the cache memory. 

Experiments with four and eight threads, also on two cores of the same machine, were also 
performed, whose results can be seen in Figure [T](b) and (c). Evaluating the curves, one can see 
that although the overall execution time increased since more threads were allocated to the same 
core, the same behavior as the previous experiment was detected, where SMM leaded to the best 
performance, mainly for ^{v) > 3AIB, while, SC was always worse. Note that, the number of cache 
misses followed the same pattern as the one observed in Figure[T](a). 

The results of another experiment can be seen in Figure |2] where the number of threads 
n = 2,4, 8, 16, 32, 64 was executed on one machine, being divided between its cores. In the case 
of n < 8, no more than one thread was executed per core, avoiding therefore, the SC allocation. For 
77- = 2, 4, no cache sharing was held. 

Some interesting conclusions can be withdrawn from this experiment. For more than 3MB per 
thread, the higher is the number of threads, the higher is the application execution time, suggesting 
that it is not worth executing more than one thread per processor. The bottom line is to allocate a 
number of threads per machine that does not fill the cache capacity. 



(a) ComputBtion Phase - two threads on two cores on one machine 




(c) Computation Phase - eight threads on two cores on one machine 
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Figure 1. Analysis on the execution of (a) 2, (b) 4 and (c) 8 threads in two cores on one machine. 
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Figure 2. Analysis on the execution from 2 to 64 threads in eight cores on one machine. 



3.2. Communication phase 

In this experiment, the appHcation consists of one computation and one communication phases, 
as seen in Figure |3] It consists of two tasks or threads, v and u, allocated under the SC, SCM, 
SMM and DM (to evaluate the communication influence also between distinct machines) allocation. 



respectively. It is important to note that whatever the allocation considered, the threads are 
practically not being executed in parallel due to the application topology. As shown in Table IH 
the communication phase time with threads allocated to the same machine is practically negligible. 
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Figure 3. Computation phase - using more than two cores 



The experiment was repeated by executing ten threads in two core under the SC, SCM, SMM and 
DM allocations. The application starts with one thread executing on one core its computation phase, 
and then sends a message to the another thread allocated to another core. This thread, after receiving 
the message and executing its computation phase, sends a message to another thread also allocated 
to the first core. This pattems follows for remaining threads, which upon receiving a message, 
execute the computation phase and then send a message to a different thread. Remark that a thread 
terminates as soon as it sends a message. 

The results of this last experiment are shown in Table |II] and in Figure lU and they represent the 
total execution times, with a varying message size a;(u, v) = 1MB, AMB and 8MB, respectively, 
where the x-axis of each graph corresponds to the vector size iJ,{v) of task v. From these results, one 
can note that when the vector size fi{v) is less then 6MB, the worst results are those produced by 
the DM allocation, since the communication cost associated with the message transmission inside a 
same machine is the smallest one. However, for fj,{v) > 6M B, the contention memory problem may 
arise, depending on the size of the message being sent. The overall execution time is slightly better 
for DM when the messages are smaller than 8MB, that is io{u, v) <c 8M B. Remark that a 8MB 
message cannot be considered a very long one considering the nowadays network performance. 



Table I. Sending Time 



Alloc ^^-^^ 


1MB 


2MB 


3MB 


4MB 


5MB 


6MB 


7MB 


8MB 


9MB 


10MB 


uj(u,v) = 1MB message 


SC 
SCM 
SMM 

DM 


0.000002 
0.000001 
0.000001 
0.074050 


0.000001 
0.000001 
0.000001 
0.074154 


0.000002 
0.000001 
0.000001 
0.074178 


0.000002 
0.000001 
0.000001 
0.074095 


0.000001 
0.000001 
0.000001 
0.074185 


0.000001 
0.000001 
0.000001 
0.074185 


0.000001 
0.000001 
0.000001 
0.074200 


0.000001 
0.000001 
0.000001 
0.074115 


0.000001 
0.000001 
0.000001 
0.074162 


0.000001 
0.000001 
0.000001 
0.074194 


uj{u,v) = 4MJ$ message 


SC 
SCM 
SMM 

DM 


0.000001 
0.000001 
0.000001 
0.345687 


0.000002 
0.000001 
0.000001 
0.345788 


0.000001 
0.000001 
0.000001 
0.345731 


0.000002 
0.000001 
0.000001 
0.345706 


0.000001 
0.000001 
0.000001 
0.345729 


0.000001 
0.000001 
0.000001 
0.345763 


0.000001 
0.000001 
0.000001 
0.345771 


0.000001 
0.000001 
0.000002 
0.345788 


0.000002 
0.000001 
0.000001 
0.345780 


0.000001 
0.000001 
0.000001 
0.345723 


uj{u,v) = SMJi message 


SC 
SMC 
SMM 

DM 


0.000001 
0.0000008 
0.0000014 
0.7022743 


0.000001 
0.000001 
0.000001 
0.702273 


0.000001 
0.000001 
0.000001 
0.702338 


0.000002 
0.000001 
0.000001 
0.702368 


0.000001 
0.000001 
0.000001 
0.702299 


0.000002 
0.000001 
0.000001 
0.702308 


0.000001 
0.000001 
0.000001 
0.702349 


0.000002 
0.000001 
0.000001 
0.702374 


0.000001 
0.000001 
0.000001 
0.702367 


0.000001 
0.000002 
0.000002 
0.702362 



3.3. Multicore Clusters Model - MCM 

In the light of the above analysis, this section describes the proposed Multicore Cluster Model 
(MCM), where a multicore cluster CM = {Mq, Mi,M2, . . . , M„i} is set of m machines, where 



Table II. Total Time - ten threads in one machine 



Alloc ^"----^ 


1MB 


2MB 


3MB 


4MB 


SMB 


6MB 


7MB 


SMB 


9MB 


10MB 












= 1MB message 










SC 
SCM 
SMM 

DM 


1.3374 
1.3414 
1.3538 
2.1571 


2.7777 
2.7778 
2.8219 
3.5942 


4.0447 
4.0471 
4.1223 
4.9074 


5.7713 
5.7029 
6.4391 
6.6670 


12.2419 
13.3135 
13.5027 
14.7997 


35.2310 
36.4155 
36.7180 
36.8324 


69.7924 
69.2166 
69.2976 
67.1933 


100.1719 
99.9814 
98.9662 
95.5452 


121.5122 
122.7039 
122.3274 
118.8403 


140.4271 
140.8029 
140.6743 
137.5865 




= 4MB message 


sc 

SCM 
SMM 
DM 


1.4811 
1.4850 
1.5016 
4.7281 


2.9270 
2.9283 
2.9637 
6.1448 


4.1948 
4.1952 
4.2477 
7.4163 


5.9055 
6.2532 
5.9298 
9.3374 


13.6012 
12.5493 
13.5898 
16.7040 


34.0697 
37.0250 
37.7790 
38.0295 


69.1819 
68.2627 
69.5724 
71.5360 


99.6317 
100.3262 
99.4298 
98.9790 


122.0155 
122.6561 
122.9522 
121.9815 


140.5231 
140.5614 
140.6809 
140.0319 




= SMB message 


sc 

SCM 
SMM 
DM 


1.6758 
1.6788 
1.6958 
8.1402 


3.1227 
3.1228 
3.1611 
9.5474 


4.3917 
4.3935 
4.4446 
10.8190 


6.0852 
6.2708 
6.1554 
12.7849 


10.1448 
10.7288 
11.0589 
18.1668 


29.0173 
31.3465 
33.9579 
36.0094 


71.8652 
72.1163 
72.8210 
75.6678 


101.9864 
102.8356 
102.1969 
103.3968 


123.6068 
124.0472 
123.6122 
125.3568 


140.8053 
140.7693 
141.1606 
143.6539 



each machine A^i, 1 <i< m consists of a set of p processors = {P(j q), ^'(i,2)i • • • > 
In turn, each processor P(i.j) consists of a set of c cores, being each one denoted by C(^ij^ky 

Cores in the machine Mi share the global main memory, gnii, with capacity gma and cores in 
the processor P{i.j) share a cache memory in a given level. Each processor P{i,j) in each machine 
Mi has a set of I cache memories CMi ~ {cm(i ^ q) : cTO(i.j^i) , . . . , cm(i ^ ;) }. The capacities of each 
cache cm(^i j j.-j is denoted by cmc(^i j f^y such that cmc(^i^j^k) < g'mci, i.e., the capacity of the cores 
are smaller than the global memory one. 

Every two cores C^i_j^ki) and C(,;.j^fc2)> which share the cache memory crm^ij) are called neighbor 
cores. Also, all the cores in a machine share the global memory grrii. 

All the cores in the machine Mi have the same computational slowdown index csii, which is an 
estimation of the computational power of each core in Mi, as defined in |j2TJ. Therefore, MCM 
models homogeneous cores inside a machine, but the machines are not necessarily homogeneous. 
Thereby, the sole execution time associated with task v in a core, say, C(i,j^k), is then et{v, C(^ij_k)) = 
csii X Ei- . 

Concerning the cache influence on the application performance, this work defines the worst case 
execution time of a task v on a given core C(i, j, k) due to the number of cache misses that might 
occur, which depends on the amount of memory already allocated. Hence, the execution time of 
task V is established not only by the computational slowdown index, but also, the amount of data 
already allocated to cm(^ij_k) and the main memory gnii. 

An edge {u,v) represents the dependency between tasks u and v and also, the exchange 
of information between them, whose amount is given by lj{u,v). The communication time to 
transmit this data between two machines, say. Mi and Mj is then ct{{u, v), Mi, Mj) = v) x 
lat{Mi, Mj), where lat{Mi, Mj) is the communication latency associated with the link between Mj 
andMj. 

Considering the previous tests related to the communication phase, it is considered, in MCM, that 
the communication cost inside a machine is negligible. 

3.4. A Load Balance Model 

Regardless of the computational time associated with the sole execution of an application task in a 
core, this time is actually influenced by the amount of tasks that are being executed on neighbor 
cores. Let e{v) be the computational weight of a task v, ^„ memory amount allocated when 
executing v, and lu{u, v) the communication weight from each one of the immediate predecessors 
u G pred{v). Suppose that u is allocated to core C(io.jo,feo)- The task v is allocated to C(ii ji.fci), 
which is related to C(iojo.fco) depending on the following conditions: 

1. if {fj,u + ^-v) < cmc(ii ji fci), the execution time of v is the smallest one if either iO = il, that 
is, if the amount of data required by both u and v is smaller than the cache memory capacity, 
the computational time will be the smallest if both tasks are allocated on the same machine 
but distinct cores, no matters if cache memory is shared or not. In the case {u, v) e E, for 



Total Time - messages with Uf(u,v) = 1 MB 




Figure 4. Total execution time (s) of ten threads under the SC, SCM, SMM and DM allocation 



uj{u, v) < LBjnsg, the total execution time of v will be smaller if both tasks are executed in 
the same machine. 

2. if {flu + fiv) > c?7ic(iiji.fci), the computation time of v is smaller if iO = il, jO ^ jl and 
fcO ^ fcl , that is, if the amount of data required by both u and v is more the cache capacity, the 
computation time of v will be smaller if both tasks are executed on distinct cores of the same 
machine, but cache is not shared (non-neighbor cores). In the case (u, i;) e E, the amount of 
data to be transmitted should be considered: 



(a) if it is bigger than the cache size, the computation time of v is smaller if iO = «1, that is, 
both tasks are executed in the same machine. 

(b) otherwise, the communication message is smaller than the whole cache, v should be 
allocated to a different machine, that is, iO il. 

Although conflicting, condition 12. al and IZb] relies on the fact that it is cheaper to send small 
messages via network than to keep locally. On the other hand, for long messages (in this 
work, no more than 8MB), memory contention for such messages is not as expansive as the 
communication time via network. 



4. LOAD BALANCE OF A PARALLEL BRANCH-AND-BOUND BASED ON MCM 

In order to analyze and validate MCM, a load balance procedure based on the MCM model was 
developed in the context of a parallel branch-and-bound {PBk.B) algorithm applied to the Set 
Partitioning Problem. 

Branch-and-bound is a widely used technique for solving NP-hard optimization problems. Such 
algorithms search the space of solutions following a tree enumeration. As the computations along 
the subtrees can be accomplished almost independently, they are considered to be well suited for 
parallelism. 

There exists a variety of papers in the literature that propose parallel branch-and-bound algorithms 
or frameworks to ease its development for distributed ||44] l45l l46l l47l and shared memory 
l45l |49l ISOl IST] l52l |6l architectures. However, to the best of our knowledge, few of them 
explores both shared and distributed memory. Moreover, they do not consider the memory hierarchy 
of multicore processors in their solutions ll53l . 

For a better understanding of this method, an introduction of the sequential BhB applied to the 
Set Partition Problem follows. 

4.1. Sequential Bk,B applied to the Set Partitioning Problem 

Given n variables .ti, . . . , a;„ with corresponding costs ci, . . . , c„ and 0-1 coefficients aij, . . . , a,y, 
for j — 1, . . . , m, the Set Partitioning Problem (SPP) is the problem of assigning 0-1 values to these 
variables such that Xl"=i o-ij^i ~ for J = li • • ■ 7™; minimizing Yll=i ^i^i- Besides the many 
applications of this problem, the SPP is a problem of great interest because it is a natural special 
case of integer programming. 

4.1.1. Lower Bound A straightforward lower bound on the optimal solution for this problem can 
be calculated by solving of its continuous relaxation 
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where the ttj variables may assume either positive or negative values. 

The advantage of using (|4]|5]l to instead of ([T][3]l is that optimality is not necessary. In our 
branch-and-bound procedure, we use the following heuristic to calculate a feasible dual solution 
that approaches its optimal solution in a reduced computational time. 

Our dual heuristic repeats two main steps by a fixed number of iterations. The first step, that we 
call the forward step, consists of increasing the nj values as much as possible. Then, in the backward 
step, it reduces some nj values while increasing others aiming to be able to improve the lower bound 
in the next forward step. Hence, the backward step is not executed in the last iteration. 

The forward step is also divided into a number of iterations. In each iteration, the same value Ai 
is added to each nj that does not belong to a saturated constraint, i.e., Ai is added to ttj if and only if 
Sjli ^ij^'j < '^i for * such that = 1. Since Ai is chosen as the maximum value that will keep 
all constraints ^ satisfied, at least one new constraint becomes saturated upon every iteration. The 
forward step stops when no more ttj variables can be increased. This step is part of a well-known 
approximation algorithm for the SPP |i54J. 

In the backward step, the value of iTj is decreased by A2(aj — 1), for some A2, where aj is the 
number saturated constraints where tt^ has a non-zero coefficient. If Uj ~ 0, then tt^ is increased by 
A2. The value of A2 chosen so that the current lower bound is multiplied by a given factor 9. We 
use = 0.5 in the first iteration of the root node and 9 — 0.3 in the first iteration of the remaining 
nodes. After each iteration, 9 is multiplied by 0.7. We perform 10 iterations in the root node and 5 
in the remaining nodes. 

4.1.2. Branching We do branching on the constraints (|2). For a selected row j, we create one branch 
for each i with aij = 1 where the variable xi is fixed to one. 

One important characteristic of the SPP is that each child node can be substantially smaller than its 
parent. Whenever a variable Xi is fixed to one, every variable x^ such that both aij ~ 1 and auj = 1 
for some j can be fixed to zero. Then, every constraint (|2) where Xi has a non-zero coefficient can 
be removed. In our method, the remaining constraints inherit the values of ttj from the parent node. 

Next, we describe the criterion used to select a constraint j for branching. Let 5i be the number of 
constraints I such that au = Q. We select the constraint j with the smallest value of ^ ie{i „> 5i, 

which represents the total number of constraints in all child nodes that would be created. 

In order to find feasible solutions earlier, we process the child nodes in a non-decreasing order of 
(ci — J2j=i ^ij'^j ) /^i- The branch-and-bound tree is traversed in a depth-first search fashion. 

A more sophisticated and effective dual heuristic for the set partitioning problem has been 
proposed recently in ll55l . However, we decided to use our own heuristic because it is simpler and 
achieves comparable lower bounds for the instances used in our experiments. 

4.2. Parallel Branch-and-Bound applied to the Set Partitioning Problem - PBBs p p 

The parallel algorithm was grounded on the perviously described Branch-and-Bound algorithm 
for the Set Partitioning Problem. The PBBs pp incorporates interesting characteristics in relation 
to memory management. At first, it does not generate a binary tree, and actually, the number of 
subtrees generated by each node can vary a lot. Also, nodes execution times are usually very small, 
on average between 0.001 to 0.006 seconds, depending on the instance. However, many of these 
nodes can need a larger amount of memory (this necessary amount is referred as node size). 

Table|III]presents information about node sizes in bytes and their corresponding times in seconds. 
For four instances, it is shown the five smallest (five first lines of each instance) and the five largest 
node sizes (the five remaining lines of each instance) for four different instances. The table also 
presents the associated levels (distance from the root) of those nodes in the Bk.B tree. The instances 
used in the tests were randomly generated. The two first numbers of the instance name refer to the 
quantity of items and sets, respectively. The remaining information refers to the probability that 
items appear in the set, followed by the seed of randomness. 

It can be observed that all executions times of the nodes are very small. It is important also to 
note that the lowest level nodes demand much more memory than the highest level ones. Since the 
quantity of saturated constraints are smaller in lowest level nodes than in the highest level ones. 



Table III. Example reporting the level, size and execution time of nodes for four instances of SPP 



Level 


Node Size (Bytes) 


Node Execution Time 


Level 


Node Size (Bytes) 


Node Execution Time 


190-400-0.03 


1100-500-0.03 


17 


448 


0.001 


25 


420 


0.001 


17 


452 


0.001 


22 


424 


0.105 


18 


456 


0.085 


22 


448 


0.001 


18 


480 


0.001 


26 


472 


0.001 


16 


516 


0.001 


27 


480 


0.005 


2 


9092 


0.005 


1 


9836 


0.096 


3 


9208 


0.005 


2 


10068 


0.006 


1 


9216 


0.076 


1 


10452 


0.001 


2 


9344 


0.005 


2 


10888 


0.006 


1 


9436 


0.062 


1 


11780 


U.Uoj 


1110-750-0.03 


1200-650-0. 


92-100 


24 


508 


0.001 


25 


2608 


0.001 


24 


516 


0.012 


19 


2640 


0.001 


29 


528 


0.001 


23 


2724 


0.000 


29 


532 


0.034 


22 


2728 


0.001 


24 


532 


0.026 


23 


2828 


0.002 


2 


17072 


0.009 


1 


16960 


0.003 


1 


17476 


0.006 


1 


17080 


0.009 


1 


17628 


0.011 


2 


17400 


0.011 


1 


18492 


0.003 


1 


18616 


0.003 


1 


18608 


0.019 


1 


19004 


0.008 



Figure |5] presents the execution time versus node size for the instance 190-400-0.03. Most of 
the nodes spend very small computation time, however their sizes vary a lot, from 50 Bytes to 9.6 
KBytes. All other instances analyzed in this work presented similar characteristics. 
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Figure 5. Execution time and nodes size for the 190-400-003 instance for the Set Partitioning Problem 



4.2.1. The load Balance Framework The PBBspp algorithm assumes a static assignment of 
processes to machines such that exactly one process is assigned to each physical machine Mi of 
the cluster. A process is composed of as many threads as the number of cores of machine Mi, 
including a manager thread, MTi, which is responsible for generation of the remaining threads in 
Mi, called workers, and for communication with other machines of the cluster. At each core C(^ij_k) , 
a worker thread denoted as T(^ij.k) executes the B&iB tree nodes until it becomes idle, when then it 
initiates a procedure to obtain new subtrees from other overloaded worker threads. A unique leader 



thread (one leader per application), created on the machine Mq and denoted as LT, is responsible 
for starting and terminating the application. 

When a worker thread T(^ij^k) receives a node, it executes a branch-and-bound procedure which 
generates other nodes that are kept in a local list of nodes TLf, j.). In accordance with a BlkB 
parameter, each subtree can be traversed either in breadth or depth way, which in turn can affect 
the size of the list TL(i j fe). In both traverse schemes, the proposed load balance strategy respects 
the associated cache size in accordance with the Condition [T] of the Load Balance Model stated in 
Section[3H 

The manager thread MTi is responsible for requesting load from another machine in the system. 
Let Mj be a machine with overloaded threads. MTj removes parts of nodes from the lists of all 
threads, and sends them to MTi, that requested load. If MTi is not able to obtain more load and 
all the respective threads are idle, it reaches its local termination condition, and informes this to 
the leader of the application LT. The PBBs p p terminates when LT receives the local termination 
condition from all manager threads in the system. 

Figure |6] shows an example of two machines Mq and Mi, each one with a processor, ^'(o.o) and 
P(ifi), respectively. Each processor has two cores C(i .,_o) and C(i ,,^1) that share a common cache. 
The procedures executed by threads are represent by rectangles. Additionally, the figure shows the 
global lists, MLq and MLi, used in the inter machine load balancing, and the local lists, rL(o,o,o)' 
T_L(o 0,1), 3^L(i 0) and rL(i q 1) used in the load balancing among worker threads of the same 
machine. 
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Figure 6. The Load Balance Framework on the target architecture 



4.3. Load Balance Algorithms 

The initial load distribution is performed by LT, which executes the root node of the parallel B&lB 
tree. As shown in Algorithm[T| the generated nodes are placed in the list GL (line[T|. Considering 
that the function number N ode s{GL) returns the quantity of nodes in this list, LT evenly shares the 
nodes among the worker threads (line|2), by sending Load message (lines 2-10). 

In case of a thread Ti^i.j,k) does not receive any initial load (i.e. when number N ode s{GL) < 
m*p* c) or finishes executing its current load, it starts a load request procedure by executing 
Algorithmic] which is actually performed whenever T(^i.j,k) becomes idle. 

Upon finishing the execution of nodes of TLf^i j ^y the thread T(^i.j^k) starts the load balance 
procedure in order to obtain nodes from other overloaded threads, whether exists, in the following 
order: a neighbor core at first (Algorithm |2); secondly from other threads of the same machine 
since the neighbor threads are underloaded or even idle, i.e. their respective node lists are empty 



Algorithm 1 Initial Distribution managed by LT 



1: Gi = Solve(RootNode); 

2: numNodeS^ nun^berNodes{GL) 

3: for alH <— to TO do 
4: for all i ^ to p do 
5: for all fc <^ to c do 
6: Load <— nodes(GL, numNodes); 

7: Send Load to Tfij^fc); 

8: end for 
9: end for 
10: end for 



(Algorithm|3] lines 6-9); and finally, from another machine, if the thread T(^ij.k) is not able to obtain 
load from other threads in its own machine Mi (Algorithm[3] line 1 1). 

Upon receiving a load request message, T(^ij.k) sends a number of nodes from its local list as 
presented in line 2 of Algorithm |4] or send a message informing that its list is also empty as shown 
in line 6 of the same algorithm. 

Concerning the manager thread, when it receives a number NT of local load requests, it sends a 
request to another machine in a sequence of machines to be request, as presented in Algorithm 
|5l When the other machine, answers the request by sending load, it shares the received load 
among the requesting load threads, as depicted in Algorithm |6] If, at last, it is not able of obtaining 
load from any other machine, it initiates the termination detection algorithm (in Algorithm|7). 

Finally, when a manager thread receives a load request from another machine, as portrayed 
depicted in Algorithm |8] it tries to obtain load from all worker threads of its own machine. Upon 
receiving an answer from all worker threads, it forwards the total obtained load to the requesting 
machine by executing Algorithm |9] 



4.4. Implementation issues of the Load Balance Framework 

The proposed model MCM influenced the implementation of the PPbbsp in niany aspects, as 
described next. 

In order to avoid a memory cache contention, as verified in the previous study, the list of nodes 
rL(i j.fc) managed by each thread T(^i.j,k) during the PBBspp execution should not occupy more 
than its share, which is the total size of L2 cache memory divided by the number of threads that share 
it. Once this limit is reached, a recursive procedure that performs depth-first-traversal on the Bk:B 
tree is initiated. The benefits of the depth-first-traversal can highly improve the Bk,B performance. 

In order to have useful data at the last level cache when needed, the local list of nodes TL^i jj^^ 
for each thread was implemented, increasing the chances of processing them without accessing the 
main memory. However, the so called false sharing might occur when threads on different cores 
write to a shared cache line, but not at the same location. In this case, since the written locations are 
different, there is no real coherency problem, but the cache-coherency protocol sets the cache line 
to dirty, and when there exists an access request to the other location, the hardware logic will force 
a reload of a cache-line update from memory (even if not really necessary in logic terms). Frequent 
updates of the data in the shared-cache line could cause severe performance degradation. In order to 
prevent this degradation each list of nodes was allocated in a different cache line. 

Concerning the load balance procedure executed by MTi, a global list MLi of nodes is also 
created at each machine Mj. The Manager Thread MTi disposes nodes transferred from other 
machines, in MLi and distributes this total load among the threads in Mi that requested for load. The 
updating the global list, in both cases of storing and removing nodes in M Li were implemented with 
the same rules of the classical producer consumer problem, guaranteeing that data were consistent 
and no deadlock occurred. 



Considering that the load transferring inside a machine, involves only two threads, a temporary 
list of nodes is created with half of the nodes from the thread that contains load, those nodes are 
removed by the load requesting thread. 

Remark that although the global list MLi can be larger than the available cache space, it 
will be used only when the internal load balance fails. As the next section shows, it happens 
very occasionally when compared with the intemal load transfers. Note also that the time of 
communication among machines is much higher than a node processing time and transmitting 
very small loads can increase the frequency of communication in the network. In this case, the 
performance could be negatively affected. 



Algorithm 2 Load Request by T^i.j^k) when it becomes idle 

1: if rL(ij,fe) = tlien 

2: Send LoadRequest to T(^i j iy, 

3: end if 



Algorithm 3 When T(^ij^k) receives NoLoad from T^i.x.y) 

I: if (a; = j) then 

2: if (y + 1 < c - 1) then 

3: y + +; 

4: Send LoadRequest to T(^ij,y)', {Send request to another core 

in the same processor} 
5: else 

6: if (j + 1 < p - 1) then 

7: j + +; 

8: y ^0; 

9: Send LoadRequest to T^ij^yy, {Send request to another 

processor on the same machine} 
10: else 

11: Send LoadRequest to A/T^; {otherwise, forward request to 

manager thread} 
12: end if 
13: end if 
14: end if 



Algorithm 4 When T(^ij,k) receives LoadRequest from T(^ij.i) 
1: if(rL(,,,-fe) ^0)then 

2: numNodes ^ y^-Q^^ numbersNodes{TL^ij^k)) ^ numbcr of nodes of 

that fit in cm(i ,, ,)); 
3: Load -h- nodes(rL(i j-.j,), numNodes); 
4: Send Load to T(^ijx)\ 
5: else 

6: Send NoLoad to Tj^t^^i); 
7: end if 



Figure 7. Algorithms executed by works threads. 



Algorithm 5 When MTi receives LoadRequest from T^^ij^k) 


1 

2 
3 
4 
5 


if (totalldle = NT) and (a; + 1 < m) then 

X + +; 

Send LoadRequest to AfT^:; 
end if 
totalIdle++; 




Algorithm 6 When MTi receives Global LoadRequest from MTx 


1 

2 
3 
4 
5 
6 
7 
8 


Af Li Global LoadRequest; 

numNodc^ ' nurnberNodes{GlohalLoadRequest) . 

totalldle ' 

for all (j ^ to j < p) do 
for all (fc ^ to fc < c) do 

Load <— nodes(AfLi, numNodes); 
Send Load to T(^i.-j^k)'^ 
end for 
end for 




Algorithm 7 When AfT^ receives NoLoad from Af 


1 

2 
3 
4 
5 
6 


if (a; + 1 < m — 1) then 

X + +; 

Send LoadRequest to ALT^i,; 
else 

iermmaiionZJeteciionO; 
end if 




Algorithm 8 When MTi receives LoadRequest from MT^ 


1 

2 
3 
4 
5 
6 


numLoad <— 0; 

for all ( j ^ to j < p) do 

for all (fc ^ to fc < c) do 
Send LoadRequest to T(^ij^k)', 

end for 
end for 




Algorithm 9 When MTi receives Load from T(i.j,k) 


1 

2 
3 
4 
5 
6 
7 
8 
9 


numLoad++; 

GlobalLoadRequest Global LoadRequest + Load; 
if (numLoad <~ p * c) then 

if {GlobalLoadRequest ^ 0) then 
Send GlobalLoadRequest to Afr^:; 

else 

Send AToLoad to A/T^; 
end if 
end if 



Figure 8. Algorithms executed by manager thread. 



5. EXPERIMENTAL RESULTS 



The experiments presented in this section were executed in two clusters: Cluster Rio, that was 
described in Section 3, and Cluster Oscar, described next. Each machine of Oscar has two quad- 
core processors (Intel Xeon 5355 Clovertown). Each core has one private LI cache (64 KB) and 
share one L2 cache (8MB) with another core on the same processor. All cores of a same machine 
have a uniform access to a 16GB main memory module. Cent OS 5.3 is the operating system with 
kernel 2.6.18. 

5.7. Analyzing Memory Allocation in PBBspp 

As seen in the previous section, each node of the BhB tree can produce other ones, and as a matter 
of evaluation, both depth and breadth tree traversals were tested in this work. 
When breadth traversal was used, the generated nodes, kept in the list TL^i jj^^, occupied more 
memory than the available space in L2 cache memory. Although many generated nodes in Ti(, 
guarantee that there will be load to be shared with eventually idle cores, it can cause cache access 
contention. In order to certificate that the proposed model MCM can be successfully applied in a real 
application, we executed the parallel BlkB several times varying the maximum size of TLf^^ jj^y 
Remark that, by following the model, each thread should not use more than the total cache size 
divided by the number of cores that share it. In our environment, it means that each one of the two 
threads allocated in neighbor cores should use up to 3 MB of L2 cache. The PBBs pp was executed 
with the following TL(^ij^k) maxima sizes: 1, 3, 6 and 8MB. Tests were performed in one machine 
of Cluster Rio. Note that although a breadth traversal procedures is being used, when the size limit 
of TL(i J is reached, the algorithm starts the execution of a recursive depth traversal procedure. 
Results are presented in the Table|IV] where columns |TL(i ^ ^,)|, ^Nodes, Wall Clock Time, %CM, 
are the maximum size of TL(^ij^k-), the number of nodes solved in the corresponding B&iB tree, 
the wall clock time in seconds of the PBBspp and the average number of cache misses for each 
thread and . Note that these results are averages of ten executions, and in all of the cases the standard 
deviation was negligible. 

The presented wall-clock times show that as the rL(j ^ j.) size increases, even executing similar 
number of nodes, the execution times also increase. Particularly, an abrupt time growing occurs 
when the total rL(i j fc) size exceeds the L2 cache size, confirming the ability of the proposed 
model MCM to represent memory contention. Moreover, it can also be observed that cache miss 
percentage increases with the TLjj ^.j sizes. 

5.2. Evaluating the Load Balance Framework 

In order to evaluate the efficiency of the proposed Load Balance Framework, PBBspp was 
executed both in accordance with the proposed framework and also without any load balance 
procedure. Tests were executed on two machines of the Oscar cluster running eight threads, one 
at each core. In the version that no load balance procedure was applied, only the Initial Distribution 
procedure in Algorithm is executed, and when a thread finishes its nodes, it stays idle until all threads 
also finish their executions and the application terminates. In order to evaluate the quality of the load 
distribution proposed in PBBspp the following unbalance factor was calculated in accordance with 
the generated results: Un_F actor = 1 — l^ffff , ll56l where TMed is the average of execution times 
of all the threads and TMax is the longest execution time among all of them. 
Table |V] presents, for both versions with load balance framework and without load balance, for 
each instance, the average of ten executions in seconds (Total Time), the average of the number of 
processed nodes in the corresponding Bk.B tree (# Nodes), the unbalance factor (Un_Factor), the 
coefficient of variation concerning execution times (CV), and the obtained speedup and efficiencies 
(E). 

As can be seen in Table|V]executing PBBspp under the proposed load balance framework doubled 
the efficiency, even processing similar number of nodes in most cases. It can also be noted that the 
unbalance factor was almost zero for all instances, indicating the the proposed PBBspp can really 
improve the application performance. 



Table IV. Analysis in the number of B&iB tree nodes, the wall clock time end number of caches miss when 

breath transversal is carried out 



Instances 




# Nodes 


Wall Clock 1 ime (s) 


% CM 


190-400-0.03 


1MB 


17067 


10.55 


28.1777 




jMd 


1 /UO/ 


1 1 Ad 

1 i.4y 


Zy.ZZoZ 




6MB 


17067 


1 1.58 


29.3848 




9MB 


17067 


12.09 


29.4519 


190-400-0.04 


1MB 


107205 


41.09 


28.9852 




J Mrs 


iU /zUj 


44. Z4 


jKJ.jj / J 




6MB 


107205 


44.29 


31.0947 




9MB 


107205 


46.67 


31.3874 


190-400-0.05 


1MB 


279272 


89.55 


29.1691 








y4.0 / 


jU. J j4j 




6MB 


219212 


101.12 


31.2821 




9MB 


279272 


104.65 


31.3414 


1100-500-0.03 


1MB 


28641 


20.50 


26.3336 




J Mrs 


25041 


ZZ.jO 


2 / .uoy2 




6MB 


28641 


22.45 


27.2445 




9MB 


28641 


24.04 


27.3580 


1100-500-0.04 


1MB 


409252 


201.60 


27.7242 




jMd 




21 J.OJ 


2?5.4oy5 




6MB 


409252 


222.93 


28.5550 




9MB 


409252 


225.28 


29.3350 


1100-500-0.05 


1MB 


1999934 


638.46 


29.5558 




3MB 


iyyyyj4 


^AC /^T 

D43.02 


zy.oyji 




6MB 


1999934 


648.46 


29.9513 




9MB 


1999934 


679.46 


29.9788 


1110-750-0.03 


1MB 


20439643 


15704.29 


30.7038 




jiVlD 


ZU4jy04j 


1 '^Q 1 C 1 /I 

1 jyio.i4 






6MB 


20439643 


16789.79 


31.4379 




9MB 


20439643 


30764.23 


32.3299 


1200-650-0.02-100 


1MB 


12919402 


18197.06 


32.4007 




3MB 


12919402 


34126.15 


32.5090 




6MB 


12919402 


35248.22 


35.0640 




9MB 


12919402 


63456.87 


35.1265 


1200-650-0.02-152 


1MB 


24294476 


29855.11 


33.1098 




3MB 


24294476 


34644.88 


33.6970 




6MB 


24294476 


113764.58 


33.1544 




9MB 


24294476 


270142.70 


33.8493 



No results were provided to the instances 1110-750-0.04, 1110-750-0.05 and 1200-600-0.04 since 
they were executed for more than three days and their execution were halted due to lack of available 
memory. This is happened because of the initial poor load division. 

To measure the overhead of the proposed PBBspp, distinct phases of the load balance framework 
was evaluated. The number of load requests sent inside a machine and transmitted to a different 
machine are shown in Table|VTl As presented in columns, Local_Reg and Global_Req, the number 
of messages exchanged inside a machine is much higher than the one among different machines. 
Nonetheless, the time of transmitting such messages contribute much less to the total execution time 
than the messages sent via network. 

5.3. Scalability Experiments 

The last experiment aims to verify the scalability of the PBBspp, by increasing the number of 
machines available to execute the respective instance. Initially, only two machines were considered 
in order to measure the messages size exchanged between them. This was carried out to evaluate 
their impact on the application performance, since as seen in Section 13.2! MCM indicated that 
long messages sent via network might reduce performance. As shown in Table IVIIi the messages 
were never longer than 4 Mbytes where Largest, indicates the size of the largest message when 
running the respective application instance. Smallest, the size of the smallest message and Average, 
the average amongst all messages size. Secondly, it was also considered four and eight machines, 
and consequently, more threads were work in parallel. As shown in Tables IVIIII and [1x1 even 



Table V. Comparison between the PBBspp load balance mechanism and a parallel B&lB without load 

balancing for the same problem 



Instances 


Total Time (s) 


# Nodes 


Un_Factor 


cv 


Speedup 


E 


Without Load Balance 


190-400-0.03 


13.66 


31481.44 


0.6370 


0.36 


0.68 


0.04 


190-400-0.04 


28.31 


117999.67 


0.5533 


0.12 


0.74 


0.05 


190-400-0.05 


43.36 


232304.80 


0.7525 


0.10 


0.46 


0.03 


1100-500-0.03 


29.66 


112652.00 


0.3200 


0.24 


0.67 


0.04 


1100-500-0.04 


168.34 


803151.34 


0.5564 


0.22 


0.34 


0.02 


1100-500-0.05 


723.17 


1861401.20 


0.7618 


0.00 


0.97 


0.06 


1110-750-0.03 


6279.81 


37533812.60 


0.6317 


0.29 


0.37 


0.02 


1110-750-0.04 














1110-750-0.05 














1200-650-0.02-100 


15278.05 


13032890.60 


0.8010 


0.25 


0.86 


0.05 


1200-650-0.02-152 


36100.79 


24354320.17 


0.8570 


0.03 


1.15 


0.07 


1200-600-0.04 














With Load Balance PBBspp 


190-400-0.03 


5.89 


26028.11 


0.0161 


0.20 


3.43 


0.21 


190-400-0.04 


17.39 


115415.60 


0.0074 


0.11 


2.21 


0.14 


190-400-0.05 


37.78 


280954.67 


0.0043 


0.02 


2.48 


0.15 


1100-500-0.03 


23.68 


102316.00 


0.0075 


0.18 


1.88 


0.12 


1100-500-0.04 


119.14 


804957.80 


0.0019 


0.18 


4.11 


0.26 


1100-500-0.05 


267.14 


2075008.78 


0.0007 


0.04 


2.78 


0.17 


1110-750-0.03 


5893.70 


29515686.90 


0.0000 


0.22 


2.90 


0.18 


1110-750-0.04 


39343.88 


143389240.00 


0.0000 


0.05 


1.18 


0.07 


1110-750-0.05 


20018.02 


106427367.00 


0.0074 


0.03 


1.92 


0.12 


1200-650-0.02-100 


5690.85 


13006009.90 


0.0001 


0.13 


3.10 


0.19 


1200-650-0.02-152 


9786.27 


24337676.10 


0.0001 


0.18 


3.22 


0.20 


1200-600-0.04 


30200.41 


132296456.50 


0.0000 


0.01 


1.92 


0.12 



Table VI. Information on the communication 





Local 


Global 


Instances 


% Time 


ULocal^Req 


% Time 


# Global^Req 


190-400-0.03 


3.155 


50.854 


12.456 


5.500 


190-400-0.04 


1.594 


125.113 


8.226 


10.450 


190-400-0.05 


0.276 


78.597 


0.845 


5.056 


1100-500-0.03 


2.185 


108.000 


7.149 


9.600 


1100-500-0.04 


0.670 


231.631 


2.288 


13.700 


1100-500-0.05 


0.154 


249.896 


0.630 


10.684 


1110-750-0.03 


0.161 


1381.688 


0.448 


36.450 


1110-750-0.04 


0.066 


1149.050 


0.131 


33.200 


1110-750-0.05 


0.053 


657.464 


0.139 


7.699 


1200-650-0.02-100 


0.048 


647.656 


0.114 


19.931 


1200-650-0.02-152 


0.046 


906.531 


0.108 


26.450 


1200-600-0.04 


0.050 


753.875 


0.095 


22.000 



with the growing number of messages transmitted via network, performance was still improved 
by PBBspp. 

Note that, the messages sizes were never longer than 4MB, therefore priority was given to condition 
l2.al from the Load Balance Model in section U4] other than l2.bl However, due the amount of S&B 
free nodes created, more machines were allocated by PBBspp, upon the saturation of caches of 
current machines. 



6. CONCLUSIONS AND FUTURE WORK 

This paper proposes the MCM model that represents the most relevant characteristics of a multicore 
cluster, based on the results of exhaustive experiments of a synthetic application. In order to validate 
the model, it was used in the design and development of a Parallel Branch-and-Bound for the Set 
Partitioning Problem .Under the MCM, a load balance framework for solving this problem prevents 
that memory contention directly affects the performance, scheduling the nodes of the B!kB tree 



Table VII. Size Messages (KB) 



Instances 


Largest 


Smallest 


Average 


190-400-0.03 


591.73 


8.50 


336.28 


190-400-0.04 


480.80 


29.10 


310.38 


190-400-0.05 


174.81 


3.32 


75.38 


1100-500-0.03 


770.92 


16.48 


403.37 


1100-500-0.04 


982.40 


8.28 


524.22 


1100-500-0.05 


810.58 


4.43 


351.16 


1110-750-0.03 


3468.72 


11.08 


1903.83 


1110-750-0.04 


3076.16 


20.02 


736.14 


1110-750-0.05 


215.04 


14,76 


839.86 


1200-650-0.02-100 


1730.42 


86.31 


894.08 


1200-650-0.02-152 


1279.84 


30.08 


748.10 


1200-600-0.04 


1395.79 


4.92 


736.14 



Table VIII. PBBs pp execution on four machines 



Instances 


Time 


# Nodes 


% Time Local 


# Local^Req 


% Time Global 


# GlobaLReq 


Un^Factor 


Speedup 


190-400-0.03 


4.49 


25383 


5.2656 


50.1250 


35.4018 


11.2500 


0.0339 


4.505 


190-400-0.04 


10.44 


115431 


1.9193 


115.3438 


15.3741 


18.0000 


0.0084 


3.687 


I90-400-0.0S 


19.66 


282458 


1.0993 


162.5000 


7.8492 


19.0000 


0.0073 


4.765 


1100-500-0.03 


11.83 


90535 


3.1212 


90.7188 


18.2631 


15.5000 


0.0131 


3.757 


IlOO-SOO-0.04 


94.38 


1166225 


0.8258 


291.9375 


5.6362 


26.5000 


0.0022 


5.184 


I100-500-0.0S 


148.38 


2247700 


0.9256 


401.2188 


2.4457 


24.5000 


0.0041 


49.990 


I110-7S0-0.03 


2345.01 


20947396 


0.3405 


1492.2813 


1.2305 


66.0000 


0.0000 


7.291 


I110-7S0-0.04 


11338.25 


127309716 


0.1154 


1368.5000 


0.3227 


47.0000 


0.0000 


4.106 


I110-7S0-0.0S 


7349.01 


111828773 


0.0985 


932.8438 


0.4027 


29.2500 


0.0001 


5.217 


I200-6S0-0.02-100 


2583.50 


12962936 


0.0792 


675.1875 


0.2889 


29.7742 


0.0001 


6.839 


I200-6S0-0.02-1S2 


4692.99 


24327659 


0.0623 


931.4688 


0.2137 


37.2500 


0.0000 


6.706 


1200-600-0.04 


9834.11 


130502067 


0.1349 


1420.6875 


0.3329 


44.5000 


0.0001 


5.907 



Table IX. PBBspp execution on eight machines 



Instances 


Time 


# Nodes 


% Time Local 


# Local^Rcq 


% Time Global 


# Global^Rcq 


Un^Factor 


Speedup 


190-400-0.03 


3.70 


39845 


0.204 


56.609 


1.726 


15.938 


0.035 


5.463 


190-400-0.04 


734 


119475 


0.249 


109.953 


2.613 


18.000 


0.018 


5.242 


I90-400-0.0S 


13.46 


289197 


0.242 


120.570 


2.977 


16..375 


0.016 


6.962 


IlOO-SOO-0.03 


6.30 


71820 


0.298 


72.016 


2.030 


16.125 


0.023 


7.052 


1100-500-0.04 


38.20 


858877 


0.589 


203.945 


3.208 


24.813 


0.004 


12.807 


IIOO-SOO-O.OS 


76.49 


2186299 


0.491 


292.836 


4.059 


24.438 


0.002 


9.698 


I110-7S0-0.03 


5133.82 


95623639 


7.539 


1325.859 


23.942 


69.375 


0.000 


3.331 


I110-7S0-0.04 


10256.64 


248729987 


11.959 


1244.086 


34.212 


50.250 


0.000 


4.539 


I110-7S0-0.0S 


3239.82 


112034575 


3.544 


861.578 


13.866 


38.250 


0.000 


11.835 


I200-6S0-0.02-100 


1476.05 


12948620 


1.930 


692.391 


9.250 


39.250 


0.000 


11.969 


I200-6S0-0.02-1S2 


2240.29 


24332087 


2.642 


819.422 


13.442 


44.625 


0.000 


14.048 


1200-600-0.04 


4255.59 


131120162 


5.959 


1273.219 


21.805 


50.625 


0.000 


13.651 



accordingly to the available amount of the cache memory. It was shown that the bottlenecks are 
avoided since the execution times improved considerably. Further analyzes will be conducted for 
the model on other classes of application. The actual application used is considered to be dynamic, 
and therefore, other applications with different characteristics will be considered in future work in 
order to show the efficiency of the model. 
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